06-More data transformation

Today we’ll be working with a dataset that I downloaded from the United States Drought Monitor.

First, let’s load the tidyverse package. The tidyverse package is a wrapper that loads a number of different useful packages with just one command. In particular, it loads the ggplot2, dplyr and tidyr packages which we will be using. (Install the package with install.packages("tidyverse") if you have not done so yet.)

library(tidyverse)

Next, run the following line of code to get the dataset as a data frame in R. read.csv() is a function in base R that allows us to read in CSV files from the internet (or from your local disk). We specify the option stringsAsFactors = FALSE because we want character variables to remain as characters.

df <- read.csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%206/ca_drought_data_20150920_20190920.csv",
               stringsAsFactors = FALSE)

Examining the dataset

As usual, the str() function gives us a good overview of our dataset:

str(df)

## 'data.frame':    12180 obs. of  13 variables:
##  $ MapDate          : int  20190917 20190910 20190903 20190827 20190820 20190813 20190806 20190730 20190723 20190716 ...
##  $ FIPS             : int  6001 6001 6001 6001 6001 6001 6001 6001 6001 6001 ...
##  $ County           : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ State            : chr  "CA" "CA" "CA" "CA" ...
##  $ None             : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ D0               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ D1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ D2               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ D3               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ D4               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ValidStart       : chr  "2019-09-17" "2019-09-10" "2019-09-03" "2019-08-27" ...
##  $ ValidEnd         : chr  "2019-09-23" "2019-09-16" "2019-09-09" "2019-09-02" ...
##  $ StatisticFormatID: int  2 2 2 2 2 2 2 2 2 2 ...

Before we start using the dataset for our analysis, we should do some sanity checks to make sure that the data we have is indeed the data we expect. For example: California has 58 counties, so we should have all 58 counties represented. We check this using summarize and the function n_distinct.

# Sanity check: should have all 58 counties
df %>% summarize(distinct = n_distinct(County))

##   distinct
## 1       58

Let’s do another sanity check: Since we have 3 years of data, we should expect around \(52 \times 4 = 208\) entries for each county. We check this by counting the number of entries, then checking that 156 is the only value we receive among all the counts:

# Sanity check: each county should have ~156 observations
df %>% group_by(County) %>%
    summarize(count = n()) %>%
    distinct(count)

## # A tibble: 1 x 1
##   count
##   <int>
## 1   210

It’s not 208, but it’s close enough. It’s also a good sign that all the counties have the same number of observaations.

Some thoughts about the fields in our dataset:

MapDate seems to be the same as ValidStart, and ValidEnd is always 6 days later than ValidStart. We can keep only the ValidStart column and treat that as the date which the data in that row is referring to.
Note also that ValidStart is a character variable, even though we want to treat it as a date. We can make this change by using as.Date().
We are not going to use FIPS and StatisticFormatID (we don’t know what they mean anyway), and State is the same for all values in our dataset, so we can safely drop these columns.
The columns from None to D4 always add up to 100%. For us, the percent area in drought is more useful, so we will make a new Drought column for that.

Let’s keep just the rows we want using select in dplyr. Type in the code below (notice how the code below changes the name for column ValidStart at the same time):

# select important rows, change Date to date variable
df_levels <- df %>% select(County, Date = ValidStart, D0:D4) %>%
    mutate(Date = as.Date(Date),
           Drought = D0 + D1 + D2 + D3 + D4)

`gather()`

The chart we want to make display data for just Santa Clara county, so we should use dplyr’s filter to get a new dataset with just the observations from Santa Clara:

# filter for just Santa Clara County
df_county <- df_levels %>% filter(County == "Santa Clara County")

Let’s make a line plot of D0 against Date:

ggplot(data = df_county) +
    geom_line(mapping = aes(x = Date, y = D0))

To make an area plot instead, replace geom_line with geom_area:

ggplot(data = df_county) + 
    geom_area(mapping = aes(x = Date, y = D0))

How can we plot D0 and D1 against Date? We could try this:

ggplot(data = df_county) + 
    geom_area(mapping = aes(x = Date, y = D0)) +
    geom_area(mapping = aes(x = Date, y = D1))

There are a number of issues with this approach:

It doesn’t scale. To display D0 to D4, we need 3 more lines of code. What if there were 10 drought levels instead?
- Both areas are filled in with black and are stacked on top of each other, so we can’t tell the difference between the two.
The y-axis reads D0 even though the values plotted are for D0 and D1.

The reason we are facing this issue is because the values of drought level are column names. We need to reorganize the dataset such that there is a Drought level column with values D0 to D4, and a corresponding Percent of land area which gives the percent of land area in that level of drought (In Hadley Wickham’s terminology, the original dataset is not “tidy” and we have to make it so.) We can use tidyr’s gather function to accomplish this.

# gather drought levels
df_county <- df_county %>% gather(D0:D4, key = "Drought level", 
                                  value = "Percent of land area")

Now we can easily plot D0 to D4 on the same chart (because our column names have spaces in them, we need to surround them with backticks (`) so that R interprets the code properly):

# make area plot
ggplot(data = df_county) + 
    geom_area(mapping = aes(x = Date, y = `Percent of land area`, 
                            fill = `Drought level`))

Let’s use scale_fill_brewer to make the area colors more theme-appropriate, and add a title:

# make area plot
ggplot(data = df_county) + 
    geom_area(mapping = aes(x = Date, y = `Percent of land area`, 
                            fill = `Drought level`)) + 
    labs(title = "Drought levels in Santa Clara County") +
    scale_fill_brewer(palette = "YlOrRd") +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

(scale_fill_brewer works in the same way as scale_color_brewer, except that it refers to the “fill” aesthetic rather than the “color” aesthetic. Both of these functions work for discrete variables; for continuous variables, use scale_fill_distiller and scale_color_distiller instead.)

Scripting for reuse

The plot we just made was for Santa Clara County. What if we were interested in another county, say Monterey County, instead? Do we have to type up all the code again? No: we could just look for all mentions of “Santa Clara County” and change them to “Monterey County”.

While this works, it is not advised: if “Santa Clara County” was used 50 times in our script, we’d have to find all 50 instances and change them. A better way to do this would be to assign the county to a variable right at the top of the script, and have all mentions of the county in the script be replaced by this variable.

The code below demonstrates this approach. (It assumes that we have already computed the data frame df_levels.) Note that the changes are actually quite minimal. For the plot, we made use of the paste() function to make the plot title dynamic.

county <- "Santa Barbara County"

# filter for the county, reshape data
df_county <- df_levels %>% 
    filter(County == county) %>%
    gather(D0:D4, key = "Drought level", 
           value = "Percent of land area")

# make area plot
ggplot(data = df_county) + 
    geom_area(mapping = aes(x = Date, y = `Percent of land area`, 
                            fill = `Drought level`)) + 
    labs(title = paste("Drought levels in", county)) +
    scale_fill_brewer(palette = "YlOrRd") +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

`separate()`

Let’s go back to the dataset df_levels. I want a visualization that tells me how much drought the counties faced over the entire period in my dataset. As a proxy for “how much drought”, I will take the mean of the Drought column across the whole period. For the visualization to be easier on the eyes, I will use coord_flip() to flip the x and y axes:

df_levels %>% 
    group_by(County) %>%
    summarize(mean_drought = mean(Drought)) %>%
    ggplot() +
    geom_col(aes(x = County, y = mean_drought)) +
    coord_flip()

Note how the counties are arranged: in alphabetical order, with the smallest at the bottom. This is the default. If we want to arrange them in some other order, we’ll need to learn more about factors (which we will next session).

All the counties have the word “County” behind them. This is not very informative and it takes up space on our plot. We can use separate() from the tidyr package to help us remove them.

df_levels %>% 
    separate(County, into = c("County", "xx"), sep = " ") %>%
    head()

## Warning: Expected 2 pieces. Additional pieces discarded in 2940 rows [1261,
## 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273,
## 1274, 1275, 1276, 1277, 1278, 1279, 1280, ...].

##    County     xx       Date D0 D1 D2 D3 D4 Drought
## 1 Alameda County 2019-09-17  0  0  0  0  0       0
## 2 Alameda County 2019-09-10  0  0  0  0  0       0
## 3 Alameda County 2019-09-03  0  0  0  0  0       0
## 4 Alameda County 2019-08-27  0  0  0  0  0       0
## 5 Alameda County 2019-08-20  0  0  0  0  0       0
## 6 Alameda County 2019-08-13  0  0  0  0  0       0

We received a warning! That’s because some county names have more than one word (excluding “County”). To remove just the word “County” at the end, instead of separating on spaces, we can tell separate() to separate the last 7 characters out. The minus sign tells separate that we are counting from the right, not the left:

df_levels %>% 
    separate(County, into = c("County", "xx"), sep = -7) %>%
    head()

##    County      xx       Date D0 D1 D2 D3 D4 Drought
## 1 Alameda  County 2019-09-17  0  0  0  0  0       0
## 2 Alameda  County 2019-09-10  0  0  0  0  0       0
## 3 Alameda  County 2019-09-03  0  0  0  0  0       0
## 4 Alameda  County 2019-08-27  0  0  0  0  0       0
## 5 Alameda  County 2019-08-20  0  0  0  0  0       0
## 6 Alameda  County 2019-08-13  0  0  0  0  0       0

OK that looks like it works. Let’s put the rest of our data processing pipeline behind this:

df_levels %>% 
    separate(County, into = c("County", "xx"), sep = -7) %>%
    group_by(County) %>%
    summarize(mean_drought = mean(Drought)) %>%
    ggplot() +
    geom_col(aes(x = County, y = mean_drought)) +
    coord_flip()

We can add a vertical line and some colors to the plot to make it more informative. Note that in the code we make a horizontal line instead: that’s because coord_flip() will make it vertical.

df_levels %>% 
    separate(County, into = c("County", "xx"), sep = -7) %>%
    group_by(County) %>%
    summarize(mean_drought = mean(Drought)) %>%
    ggplot() +
    geom_col(aes(x = County, y = mean_drought, fill = mean_drought)) +
    geom_hline(yintercept = 50, col = "red", lty = 2, lwd = 1) +
    scale_fill_distiller(palette = "YlOrRd", direction = 1) +
    coord_flip() +
    theme(legend.position = "bottom")

Optional material

More `dplyr` practice

Below are some exercise for you to practice your dplyr skills. All of them use the df_levels dataset as the starting point. Solutions are in the section below.

Select the rows with D4 being 100.
Which counties experienced D4 being 100 on 2015-10-20?
Which counties experienced D4 being 100 on 2015-10-20? Give just the county names.
Which counties experienced 100% land area in D4 at any time? (Hint: instead of n_distinct, try distinct.)
Select rows with date “2018-09-18” and return the entries in descending order of D0.
Which county experienced the most number of weeks with 100% land area in D4?

More `dplyr` practice (solutions)

Select the rows with D4 being 100.

df_levels %>% filter(D4 == 100)

Which counties experienced D4 being 100 on 2015-10-20?

df_levels %>% filter(D4 == 100 & Date == "2015-10-20")

##                  County       Date D0 D1 D2 D3  D4 Drought
## 1         Alpine County 2015-10-20  0  0  0  0 100     100
## 2         Amador County 2015-10-20  0  0  0  0 100     100
## 3      Calaveras County 2015-10-20  0  0  0  0 100     100
## 4      El Dorado County 2015-10-20  0  0  0  0 100     100
## 5         Fresno County 2015-10-20  0  0  0  0 100     100
## 6          Kings County 2015-10-20  0  0  0  0 100     100
## 7         Madera County 2015-10-20  0  0  0  0 100     100
## 8       Mariposa County 2015-10-20  0  0  0  0 100     100
## 9         Merced County 2015-10-20  0  0  0  0 100     100
## 10          Mono County 2015-10-20  0  0  0  0 100     100
## 11        Nevada County 2015-10-20  0  0  0  0 100     100
## 12        Placer County 2015-10-20  0  0  0  0 100     100
## 13        Plumas County 2015-10-20  0  0  0  0 100     100
## 14   San Joaquin County 2015-10-20  0  0  0  0 100     100
## 15 Santa Barbara County 2015-10-20  0  0  0  0 100     100
## 16        Sierra County 2015-10-20  0  0  0  0 100     100
## 17    Stanislaus County 2015-10-20  0  0  0  0 100     100
## 18        Sutter County 2015-10-20  0  0  0  0 100     100
## 19        Tulare County 2015-10-20  0  0  0  0 100     100
## 20      Tuolumne County 2015-10-20  0  0  0  0 100     100
## 21       Ventura County 2015-10-20  0  0  0  0 100     100
## 22          Yuba County 2015-10-20  0  0  0  0 100     100

Which counties experienced D4 being 100 on 2015-10-20? Give just the county names.

df_levels %>% filter(D4 == 100 & Date == "2015-10-20") %>%
    select(County)

##                  County
## 1         Alpine County
## 2         Amador County
## 3      Calaveras County
## 4      El Dorado County
## 5         Fresno County
## 6          Kings County
## 7         Madera County
## 8       Mariposa County
## 9         Merced County
## 10          Mono County
## 11        Nevada County
## 12        Placer County
## 13        Plumas County
## 14   San Joaquin County
## 15 Santa Barbara County
## 16        Sierra County
## 17    Stanislaus County
## 18        Sutter County
## 19        Tulare County
## 20      Tuolumne County
## 21       Ventura County
## 22          Yuba County

Which counties experienced 100% land area in D4 at any time? (Hint: instead of n_distinct, try distinct.)

df_levels %>% filter(D4 == 100) %>%
    distinct(County)

##                  County
## 1         Alpine County
## 2         Amador County
## 3      Calaveras County
## 4      El Dorado County
## 5         Fresno County
## 6          Kings County
## 7         Madera County
## 8       Mariposa County
## 9         Merced County
## 10          Mono County
## 11        Nevada County
## 12        Placer County
## 13        Plumas County
## 14   San Joaquin County
## 15 Santa Barbara County
## 16        Sierra County
## 17    Stanislaus County
## 18        Sutter County
## 19        Tulare County
## 20      Tuolumne County
## 21       Ventura County
## 22          Yuba County

Select rows with date “2018-09-18” and return the entries in descending order of D0.

df_levels %>% filter(Date == "2018-09-18") %>% arrange(desc(D0))

Which county experienced the most number of weeks with 100% land area in D4?

df_levels %>% 
filter(D4 == 100) %>% 
group_by(County) %>% 
summarize(count = n()) %>%
arrange(desc(count))

## # A tibble: 22 x 2
##    County               count
##    <chr>                <int>
##  1 Santa Barbara County    69
##  2 Ventura County          69
##  3 Fresno County           31
##  4 Kings County            31
##  5 Madera County           31
##  6 Mariposa County         31
##  7 Merced County           31
##  8 San Joaquin County      31
##  9 Stanislaus County       31
## 10 Tulare County           31
## # … with 12 more rows

Session info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
## [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         RColorBrewer_1.1-2 cellranger_1.1.0  
##  [4] pillar_1.4.2       compiler_3.6.1     tools_3.6.1       
##  [7] zeallot_0.1.0      digest_0.6.20      lubridate_1.7.4   
## [10] jsonlite_1.6       evaluate_0.14      nlme_3.1-140      
## [13] gtable_0.3.0       lattice_0.20-38    pkgconfig_2.0.2   
## [16] rlang_0.4.0        cli_1.1.0          rstudioapi_0.10   
## [19] yaml_2.2.0         haven_2.1.1        xfun_0.9          
## [22] withr_2.1.2        xml2_1.2.2         httr_1.4.1        
## [25] knitr_1.24         vctrs_0.2.0        generics_0.0.2    
## [28] hms_0.5.1          grid_3.6.1         tidyselect_0.2.5  
## [31] glue_1.3.1         R6_2.4.0           fansi_0.4.0       
## [34] readxl_1.3.1       rmarkdown_1.15     modelr_0.1.5      
## [37] magrittr_1.5       backports_1.1.4    scales_1.0.0      
## [40] htmltools_0.3.6    rvest_0.3.4        assertthat_0.2.1  
## [43] colorspace_1.4-1   labeling_0.3       utf8_1.1.4        
## [46] stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
## [49] broom_0.5.2        crayon_1.3.4

06-More data transformation

Kenneth Tay

Oct 10, 2019

Examining the dataset

`gather()`

Scripting for reuse

`separate()`

Optional material

More `dplyr` practice

More `dplyr` practice (solutions)

Session info

06-More data transformation

Kenneth Tay

Oct 10, 2019

Examining the dataset

gather()

Scripting for reuse

separate()

Optional material

More dplyr practice

More dplyr practice (solutions)

Session info

`gather()`

`separate()`

More `dplyr` practice

More `dplyr` practice (solutions)